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Abstract 


The recognition in real time of crowd dynamics in public 
places are becoming essential to avoid crowd related disas- 
ters and ensure safety of people. We present in this paper a 
new approach for Crowd Event Recognition. Our study be- 
gins with a novel tracking method, based on HOG descrip- 
tors, to finally use pre-defined models (i.e. crowd scenarios) 
to recognize crowd events. We define these scenarios using 
statistics analysis from the data sets used in the experimen- 
tation. The approach is characterized by combining a local 
analysis with a global analysis for crowd behavior recog- 
nition. The local analysis is enabled by a robust tracking 
method, and global analysis is done by a scenario model- 
ing stage. 


1. Introduction 


Just two decades ago, computer vision community had 
started to focus on the study of crowds in public areas or 
during public events [1]. This study is motivated by the 
increasing need for public safety and the high level of de- 
generation risk especially when a large number of people 
(crowd) is involved. 

In the research field related to crowd analytics we can 
find different sub-topics like crowd density estimation, 
crowd tracking, face detection and recognition in crowds, 
crowd behavior analysis, among others. We are interested 
in crowd behavior analysis, which is a newest area in the 
research community. Our goal is to automatically recog- 
nize crowd abnormal events in video sequences. In general, 
the usual process for activity analysis in a video sequence 
is composed of the following three stages [4]: (1) detection, 
(2) tracking and (3) event recognition. This process can be 
applied to crowds as well as individuals. 

We propose a new approach for crowd event recognition. 
The paper considers the second and the third stage of the 
process mentioned above, to improve the recognition stage. 
For this purpose in the tracking stage we compute, for every 
detected object in the first stage (detection), feature points 
(i.e. corner points) using FAST approach [2]. Then for each 
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computed feature point we build a descriptor based on His- 
togram of Oriented Gradients (HOG) [3], to finally track 
these feature points through its respective descriptors. Fi- 
nally, in the last stage (event recognition) we statistically 
analyze the vectors formed by the tracking of the feature 
points, to recognize a pre-defined event. 


2. Previous Work 


Nowadays, there are many research works related to crowd. 
The existent approaches in this field can be classified in two 
categories [5]. One of them is related to crowd event detec- 
tion, and the other, to crowd density estimation. Some ap- 
proaches for the second category are based on counting, ei- 
ther: faces, heads or persons [10, 11] but their performance 
is low when there are occlusions. There are also approaches 
based on texture and motion area ratio [6, 7, 8, 9], which 
are really useful for analysis for crowd surveillance. How- 
ever, neither of them work for event recognition because 
they cannot detect abnormal situations. 

Most of the methods in the first category aim at detect- 
ing abnormal events in crowd flows using motion patterns. 
Motion patterns correspond either to normal behaviors (fre- 
quent patterns) or abnormal behaviors (unusual patterns) 
[12, 13]. For example, Ihaddadene et al.[12] approach de- 
tects abnormal motion variations using motion heat maps 
and optical flow. They compute points of interest (POI) in 
the regions of interest given by the maps. The variations of 
motion are estimated to highlight potential abnormal events 
using a pre-defined threshold. The approach does not need 
a huge amount of data to enable learning pattern frecuency 
but it is necessary to carefully define, in advance, an ap- 
propriate threshold and the regions of interest for every 
scenario. Mehran et al. [13] propose to use social force 
model for the detection of abnormal behaviors in crowds. 
The method consists in matching a grid of particles with 
the frame and moving them along the underlying flow field. 
Then the social force is computed between moving parti- 
cles to extract interaction forces, to finally determine the on 
going behavior of the crowd through the change of inter- 
action forces in time. The resultant vector field is denoted 


as force flow, and is used to model the normal behaviors. 
The method captures the dynamics of crowd behavior with- 
out using object tracking or segmentation, nevertheless the 
obtained false positives could be problematic. 

The tracking stage is another topic for the vision com- 
munity. In the literature we can find several approaches 
for object tracking trying to solve the occlusion problem. 
Nevertheless, handling the occlusion for tracking people in 
crowd is often a harder problem to solve than for tracking 
individual. Most of the methods for tracking individuals 
with occlusion may not be so scalable to crowds. One scal- 
able method is KLT [15], which tracks feature points allow- 
ing multiple object tracking. Kaniche et al. [16] propose a 
HOG tracker for gesture recognition, which can be extended 
to multiple object tracking in crowd. They select for each 
individual in the scene a set of points and characterize them 
by computing 2D HOG descriptors, then they track these 
descriptors to construct temporal HOG descriptors. 

Our approach uses statistical pre-defined models of sce- 
narios to detect crowd events in video frames. The utiliza- 
tion of these pre-defined models allows us a more flexible 
and general way to model scenarios. We use object track- 
ing to estimate crowd direction and speed, in lieu of using a 
holistic approach for its higher accuracy. Others approaches 
use also object tracking as in [12] (optical flow), however 
our approach is more robust because we are using HOG de- 
scriptors which better characterized the tracked points. 


3. Crowd Tracking 


This section describes the tracking process for crowd 
through the feature points computed for every object de- 
tected in a frame. We briefly describe the object detection 
process which does not belong to our contribution. 

To perform object detection we use the technique pro- 
posed by Neghiem et al. [17] to calculate the difference 
between the current image and the reference one (back- 
ground). The idea is to set up the moving regions by group- 
ing foreground neighbouring pixels, where moving regions 
are classified into objects depending on their size (crowds, 
persons, groups, etc.). 

Once the moving objects are detected in the scene using 
moving segmentation we track these objects by tracking the 
feature points 


3.1 Feature Points 


After obtaining the detected moving objects in the current 
frame, we compute for each of them a set of feature points 
to track. For this, we use FAST approach [2]. However, 
any other corner detector approach could be applied like 
the one proposed by Shi et al. in [18]. Our method con- 
sists in a descendant sort out of the detected feature points 


using corner strength information. Then, from these points 
(beginning from the most significant, 1.e. the one with the 
biggest value of corner strength) a subset of feature points is 
chosen to ensure a minimum distance: between them. And 
also between all tracked points in the corresponding object. 
The minimum distance improves the feature point distribu- 
tion for an object and prevents mixing tracked points. 


3.2 2D HOG Descriptor 


We build a HOG descriptor [3] for each detected feature 
point. To compute the descriptor we define around the fea- 
ture point a block of 9 cells (3 x 3) where a cell is defined by 
a matrix of p x p pixels (p € {3,5}). Then, we compute the 
approximate absolute gradient magnitude (normalized) and 
gradient orientation for every pixel in the block using So- 
bel operator. Using gradient orientation we assign to each 
pixel from a cell one of the K orientation bins (by default 
K = 9). For each bin, we compute the sum of gradients of 
its pixel. Finally, we obtain for each cell inside a block a 
feature vector of K orientation bins. The 2D descriptor is 
then a vector for the whole block, concatenating the feature 
vectors of all its cells normalized by p. 


3.3 Descriptor Tracking 


The feature points detected in the previous frame are 
tracked in the current frame using the 2D HOG descrip- 
tors. In the current frame we calculate the mean over the 
trajectory, Sgm, of an object speed within a time window 
using all speed values from the feature points that belong to 
the same object. If the feature point is newly detected in the 
current frame we assume that Seay = Smean, Where Smean 
is the mean speed of the object at the current frame. To re- 
duce the processing time we are using a searching window 
which is define based on a searching radius. For a given 
feature point F’, the searching radius, Rs, is computed: 


1 
Rs = SGM + T X (Smean 7. Sem) (1) 


Where T is the number of frames where F was tracked. 
From equation (1), À. is more accurate when F has a longer 
track. 

The difference between two HOG descriptors, d” and 
d’™, is defined by the equation: 


9xk 
E(d",d™) = X MAX (uP, ul) x (dp -d") (2) 


i=1 


Where v” and v™ correspond to the variances of the 
HOG descriptors of d” and d™, respectively, computed 
through out the time window. 


Finally, we track F in the current frame comparing the 
difference (equation (2)) of the HOG descriptors between 
F and r (Vr a point inside the window of radius R,). 
We choose the point r’ in the current frame which better 
matches with the point Fin the previous frame by comput- 
ing the difference between their HOG descriptors. 

We update the HOG descriptor of the tracked point by 
computing the equation below: 


dë = (1 — a)d; + adf, i=1...9xK 8) 

Where df is the mean HOG descriptor and d? is the 
HOG descriptor of the point r in the current frame. a is a 
cooling parameter. In the same way, to update the variance 
of the mean descriptor bin in the current frame: 


vë = (1-a) x |d —df|+avy, i=1...9x K (4) 


a 


4. Crowd Event Recognition 


Crowd behavior can be characterized by regular motion pat- 
terns like direction, speed, etc. For this reason, the most ro- 
bust and simple approach for crowd event recognition is to 
use pre-defined models of crowd events. In this section, we 
explain the crowd motion information computed to define 
and recognize the different crowd events used in this study. 

Our approach consists in modeling crowd events through 
the information obtained with the tracking of the feature 
points. We rely on those motion vectors of feature points 
computed over multiple frames. For us a vector is a collec- 
tion of several elements which are the mean HOG descrip- 
tor, the start and end point of the trajectory of the tracked 
feature point, together with start and end time. The com- 
puted attributes (information) related to motion vectors are 
direction, speed, and crowd density. 

Direction is the property that identifies the direction of 
the trajectory of feature points (called vectors). We divide 
the Cartesian plane into 8 parts where each part is a direc- 
tion between the angles [a, a + 45] and a € {0, 45, 90, 
135, 180, 225, 270, 315}, see Figure 1. The angle of the 
vector is computed between the axis X (where x = 0 is the 
starting direction of the vector) and the vector, this measure 
decides in which of the 8 directions is classified the vec- 
tor. After this, we calculate the principal crowd directions 
considering the density percentage of feature points in each 
direction. If this percentage is bigger than a threshold t we 
assume there is a crowd in that direction. 

The speed is directly related to the length of the vectors. 
For each frame we calculate the speed of every vector con- 
sidering its length and the number of tracking frames of the 
feature point associate to the vector. We obtain the crowd 
average speed using the speed of all the vectors in the frame. 
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Figure 1: Directions in the Cartesian Plane 


For crowd density we build a grid over the image and 
then we compute the density of feature points in each grid- 
cell. This information will help us to recognize the crowd 
events. 6 crowd events are modeled, which are walking, 
running, evacuation, local dispersion, crowd formation and 
crowd splitting. The models defined for this study are de- 
scribed below: 


e Walking: corresponds to a significant number of indi- 
viduals moving at a low speed. We compute the mean 
speed, measured as pixels per frame, considering all 
vectors in a frame. We set up the threshold tı, and 
when the mean speed is under this threshold we recog- 
nize a crowd walking event. 


e Running: corresponds to a significant number of indi- 
viduals moving at a high speed. We compute the mean 
speed, measured as pixels per frame, considering all 
vectors in a frame. We use the same threshold t4, but 
when the mean speed is over tı we recognize a crowd 
running event. 


e Evacuation: corresponds to a rapid dispersion of the 
crowd in different directions. We use the attributes 
direction and crowd density to recognize this event. 
When there are more than 4 principal directions, when 
the minimum distance between the principal directions 
is over a threshold tə (euclidean distance between the 
grid-cells containing the feature points related to prin- 
cipal directions), and if the addition of the crowd den- 
sity per principal direction is over a threshold ¢3, this 
event is recognized. 


e Crowd Formation: corresponds to the merge of sev- 
eral individuals, where the individuals approach from 
different directions. Crowd density and the distance 
between the principal directions are used to model the 
current event. We set up the thresholds t4 for the dis- 
tance between the principal directions, and ts for the 
crowd density in the respective grid-cells. When the 
minimum distance is under t4 and the crowd density is 
over t5, a crowd formation event is recognized. 


e Crowd Splitting: corresponds to a cohesive crowd of 
individuals which splits into two or more flows. The 
crowd density and the distance between the principal 
directions are used to model the current event. We set 
up the thresholds tg for the distance between the main 
directions, and t7 for the crowd density in the respec- 
tive grid-cells. When the maximum distance is over 
tg and the crowd density is under t7, a crowd splitting 
event is recognized. 


e Local Dispersion: corresponds to localized movement 
of people within a crowd away from a given threat. 
This event is very similar to crowd formation/splitting 
because this model uses the same attributes, plus an- 
other one: the speed. Nevertheless the thresholds 
(also used for crowd formation/splitting) are different. 
Moreover, the threshold for the distance between the 
grid-cells has to be over a threshold tg and the crowd 
density has to be distributed between the grid-cells 
with more than 1 principal directions. The mean speed 
has to be under a threshold to. 


5. Experimental Results 


To validate our approach we have tested the PETS Dataset 
S3, High Level, which contains four sequences respectively 
with timestamps 14 : 16, 14 : 27, 14 : 31 and 14 : 33. 
For each sequence we use the videos recorded by camera 
I (View 1), and we consider that there are two video clips 
inside the sequences 14 : 16, 14 : 27 and 14 : 33 and one 
video clip for the sequence 14 : 31. A video clip is about 
130 frames long. The videos depict the 6 crowd scenarios 
described in the previous section. The crowd scenarios are 
acted by about 40 people from Reading University Campus. 
All the experiments have been performed on one view and 
our plan is to complete the experiments on the other views. 

The thresholds used in the event models have been set 
up experimentally. We are currently designing a learning 
process to compute and optimize the thresholds. 

Table 1 presents some measures to evaluate our ap- 
proach: true positives (TP), false positives (FP) and sen- 
sitivity (SN). We consider TP as the crowd event that 
matched with the ground trouth for each frame, FP as the 
not matched crowd event recognized for each frame, and 
SN is defined as TP/(T P + FN). Since the ground truth 
is not established for the S3 High Level, we have built the 
ground truth manually. 

Table 2 contains the frame number of the 7 videos clips. 

Table 3 shows the significant time intervals where the 
pre-defined events were recognized for the 7 videos clips. 
The columns are the different videos. There are 6 rows 
which represent the crowd scenarios in our study. Each el- 
ement of the table contains the frames where the event is 


Table 1: Measures to evaluate the approach 


SN 
637 |430 | 0.60 
[Walking | 976 | 90 |092 





Table 2: Frame number for each Video Clip 


:16- 107 





recognized in the corresponding video clip. The video clips 
named time_stamp-B are the continuation of the video se- 
quence time_stamp, i.e. if the last frame of time_stamp-A 
is 104 the first frame of time_stamp-B is 105. Inside the 
brackets two time intervals are separated by “;”. Significant 
time interval is when the size is bigger than 9 frames. False 
positives of crowd event can be detected as significant time 
intervals. 

Figure 2 shows some illustrations of the results of our 
approach. The black lines are the trajectories of the tracked 
feature points depicting their direction and length. 


6. Conclusion 


In this paper we have presented a novel approach for recog- 
nizing crowd events. The contributions are the combination 
of local and global analysis. The local analysis is achieved 
by tracking HOG descriptors and the global analysis is ob- 
tained by statistical analysis of the HOG motion patterns. 
Also, the use of HOG descriptors for tracking enables a high 
accuracy in crowd event recognition and a better characteri- 
zation of feature points. The approach has successfully val- 
idated on PETS dataset. There are still some errors in the 
recognized events. These errors are mainly do to the set up 
the thresholds at the level of scenario models. For future 
work we plan to improve the threshold computation by au- 
tomating the construction of scenario models. We are also 
currently computing the HOG motion vectors in 3D for the 
approach to be independence from the scene. The scenario 


Table 3: Time Intervals for Crowd Events Recognized 


14:16-A 14:16-B 14:27-A 14:27-B 14:31 14:33-A 14:33-B 
(21, a [25:45 ; 2 167] | [201 ae a 313] | [997] | [69,157 ; 253,310] [332,341] 
Crowd Splitting [98,130] | [12,52; 158,251] [363,377] 


Walking [1, 7 [109, TEE om 222] [1: Er A 55 [1,130] [1,310] [313,341 ; 348,377] 


Daa PR 
Local Dispersion | 11 [| 0 À ASS O [ER] HS | UN 





models (besides the thresholds) are easy to model by users 
and can be extended to other crowd scenarios. Definition of 
a language for modeling these scenarios can also enhance 
the flexibility of the approach to pre-define the scenarios. 
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Figure 2: The first row presents the original frames and the second row the output of our approach 


